Pinterest的技术架构
Scaling
Pinterest
Marty Weiner
Cloud Ninja
Yash Nelapati
Ascii Artist
Monday, November 11, 13
Scaling Pinterest
Evolution
Monday, November 11, 13
Scaling Pinterest
March 2010
Growth
Page views per day
Mar 2010
Jan 2011 Jan 2012 May 2012
Monday, November 11, 13
Scaling Pinterest
March 2010
Growth
Page views per day
Mar 2010
Jan 2011 Jan 2012 May 2012
Monday, November 11, 13
Scaling Pinterest
March 2010
Growth
·
RackSpace
·
1 small Web Engine
·
1 small MySQL DB
·
1 Engineer + 2 Founders
Page views per day
Mar 2010
Jan 2011 Jan 2012 May 2012
Monday, November 11, 13
Scaling Pinterest
March 2010
Growth
Monday, November 11, 13
Scaling Pinterest
March 2010
Growth
Monday, November 11, 13
Scaling Pinterest
January 2011
Growth
Mar 2010
Jan 2011
Jan 2012
Page views per day
Monday, November 11, 13
Scaling Pinterest
January 2011
Growth
Mar 2010
Jan 2011
Jan 2012
Page views per day
Monday, November 11, 13
Scaling Pinterest
January 2011
Growth
·
Amazon EC2 + S3 +
CloudFront
·
1 NGinX, 4 Web Engines
·
1 MySQL DB + 1 Read Slave
·
1 Task Queue + 2 Task
Processors
·
1 MongoDB
·
2 Engineers + 2 Founders
Mar 2010
Jan 2011
Jan 2012
Page views per day
Monday, November 11, 13
Scaling Pinterest
Monday, November 11, 13
Scaling Pinterest
September 2011
Growth
Mar 2010
Jan 2011
Jan 2012 May 2012
Page views per day
Monday, November 11, 13
Scaling Pinterest
September 2011
Growth
Mar 2010
Jan 2011
Jan 2012 May 2012
Page views per day
Monday, November 11, 13
Scaling Pinterest
September 2011
Growth
·
Amazon EC2 + S3 + CloudFront
·
2 NGinX, 16 Web Engines + 2 API
Engines
·
5 Functionally Sharded MySQL DB +
9 read slaves
·
4 Cassandra Nodes
·
15 Membase Nodes (3 separate
clusters)
·
8 Memcache Nodes
·
10 Redis Nodes
·
3 Task Routers + 4 Task Processors
·
4 Elastic Search Nodes
·
3 Mongo Clusters
·
3 Engineers (8 Total)
Mar 2010
Jan 2011
Jan 2012 May 2012
Page views per day
Monday, November 11, 13
Scaling Pinterest
It will fail. Keep it simple.
Monday, November 11, 13
Scaling Pinterest
If you’re the biggest user of a
technology, the challenges will
be greatly amplified
Monday, November 11, 13
Scaling Pinterest
January 2012
Growth
Monday, November 11, 13
Scaling Pinterest
April 2012
Growth
Mar 2010
Page views per day
Mar 2010
Jan 2011
Jan 2012 May 2012
Monday, November 11, 13
Scaling Pinterest
April 2012
Growth
Mar 2010
Page views per day
Mar 2010
Jan 2011
Jan 2012 May 2012
Monday, November 11, 13
Scaling Pinterest
April 2012
Growth
Mar 2010
·
Amazon EC2 + S3 + Edge Cast
·
135 Web Engines + 75 API Engines
·
10 Service Instances
·
80 MySQL DBs (m1.xlarge) + 1 slave
each
·
110 Redis Instances
·
60 Memcache Instances
·
2 Redis Task Manager + 60 Task
Processors
·
3rd party sharded Solr
Page views per day
Mar 2010
Jan 2011
Jan 2012 May 2012
Monday, November 11, 13
Scaling Pinterest
April 2012
Growth
Mar 2010
Page views per day
Mar 2010
Jan 2011
Jan 2012 May 2012
·
12 Engineers
·
1 Data Infrastructure
·
1 Ops
·
2 Mobile
·
8 Generalists
·
10 Non-Engineers
Monday, November 11, 13
Scaling Pinterest
Scaling Pinterest
Monday, November 11, 13
Scaling Pinterest
April 2013
Growth
Page views per day
April 2012
April 2013
Monday, November 11, 13
Scaling Pinterest
April 2013
Growth
Page views per day
April 2012
April 2013
Monday, November 11, 13
Scaling Pinterest
April 2013
Growth
·
Amazon EC2 + S3 + Edge Cast
·
400+ Web Engines + 400+ API
Engines
·
70+ MySQL DBs (hi.4xlarge on SSDs)
+ 1 slave each
·
100+ Redis Instances
·
230+ Memcache Instances
·
10 Redis Task Manager + 500 Task
Processors
·
65+ Engineers (130+ total)
Page views per day
April 2012
April 2013
Monday, November 11, 13
Scaling Pinterest
April 2013
Growth
·
Amazon EC2 + S3 + Edge Cast
·
400+ Web Engines + 400+ API
Engines
·
70+ MySQL DBs (hi.4xlarge on SSDs)
+ 1 slave each
·
100+ Redis Instances
·
230+ Memcache Instances
·
10 Redis Task Manager + 500 Task
Processors
·
65+ Engineers (130+ total)
Page views per day
April 2012
April 2013
·
8 services (80 instances)
·
Sharded Solr
·
20 HBase
·
12 Kafka + Azkabhan
·
8 Zookeeper Instances
·
12 Varnish
Monday, November 11, 13
Scaling Pinterest
April 2013
Growth
Page views per day
April 2012
April 2013
·
65+ Engineers
·
7 Data Infrastructure + Science
·
7 Search and Discovery
·
9 Business and Platform
·
6 Spam, Abuse, Security
·
9 Web
·
9 Mobile
·
2 growth
·
10 Infrastructure
·
6 Ops
·
65+ Non-Engineers
Monday, November 11, 13
Scaling Pinterest
Monday, November 11, 13
Scaling Pinterest
Technologies
Monday, November 11, 13
Scaling Pinterest
ELB
Routing & Filtering
(Varnish)
All connection pairings managed by ZooKeeper
Puppet
StatsD
Monit
Sensu
Web App
(Python)
API App
(Python / JS / HTML)
Task Processing
(Python/Pyres)
MySQL Service
(Java/Finagle)
Memcache Mux
(Nutcracker)
Follower Service
(Python/Thrift)
Feed Service
(Python/Thrift)
Images
(S3 + CDN)
Sharded
MySQL
Memcache
Redis
HBase
Task Queue
(Redis)
Search Service
(Python/Thrift)
Spam Service
(Python/Thrift)
Arch
Overview
Monday, November 11, 13
Scaling Pinterest
Data
Pipeline
Tripwire (Spam)
Qubole
S3
API App
(Python)
Task Processing
(Python/Pyres)
Kafka
S3 Copier
Pinball
Web App
(Python)
Redshift
Monday, November 11, 13
Scaling Pinterest
Web App
NGinX
Website Rendering (x8)
(Python / JS / HTML)
API
Monday, November 11, 13
Scaling Pinterest
Choosing
Your
Tech
Questions to ask
• Does it meet your needs?
• How mature is the product?
• Is it commonly used? Can you hire people who have used it?
• Is the community active?
• How robust is it to failure?
• How well does it scale? Will you be the biggest user?
• Does it have a good debugging tools? Profiler? Backup
software?
• Is the cost justified?
Monday, November 11, 13
Scaling Pinterest
Hosting
Why Amazon Web Services (AWS)?
• Variety of servers running Linux
• Very good peripherals: load balancing, DNS, map
reduce, basic security, and more
• Good reliability
• Very active dev community
• Not cheap, but...
Monday, November 11, 13
Scaling Pinterest
Hosting
Why Amazon Web Services (AWS)?
• Variety of servers running Linux
• Very good peripherals: load balancing, DNS, map
reduce, basic security, and more
• Good reliability
• Very active dev community
• Not cheap, but...
• New instances ready in seconds
Monday, November 11, 13
Scaling Pinterest
Hosting
AWS Usage
• Route 53 for DNS
• ELB for 1st tier load balance
• EC2 Ubuntu Linux
• Varnish layer
• All web, API, background appliances
• All services
• All databases and caches
• S3 for images, logs
Monday, November 11, 13
Scaling Pinterest
Code
Why Python?
• Extremely mature
• Well known and well liked
• Solid active community
• Very good libraries specifically targeted to web
development
• Effective rapid prototyping
• Open Source
Monday, November 11, 13
Scaling Pinterest
Code
Why Python?
• Extremely mature
• Well known and well liked
• Solid active community
• Very good libraries specifically targeted to web
development
• Effective rapid prototyping
• Open Source
Some Java and Go...
• Faster, lower variance response time
Monday, November 11, 13
Scaling Pinterest
Code
Python Usage
• All web backend, API, and related business logic
• Most services
Monday, November 11, 13
Scaling Pinterest
Code
Python Usage
• All web backend, API, and related business logic
• Most services
Java and Go Usage
• Varnish plugins
• Search indexers
• High frequency services (e.g., MySQL service)
Monday, November 11, 13
Scaling Pinterest
Production
Data
Why MySQL and Memcache?
• Extremely mature
• Well known and well liked
• (MySQL) Rarely catastrophic loss of data
• Response time to request rate increases linearly
• Very good software support: XtraBackup, Innotop, Maatkit
• Solid active community
• Open Source
Monday, November 11, 13
Scaling Pinterest
Production
Data
MySQL and Memcache Usage
• Storage / Caching of core data
• Users, boards, pins, comments, domains
• Mappings (e.g., users to boards, user likes, repin info)
• Legal compliance data
Monday, November 11, 13
Scaling Pinterest
Why Redis?
• Well known and well liked
• Active community
• Consistently good performance
• Variety of convenient and efficient data structures
• 3 Flavors of Persistence: Now, Snapshot, Never
• Open Source
Production
Data
Monday, November 11, 13
Scaling Pinterest
Redis Usage
• Follower data
• Configurations
• Public feed pin IDs
• Caching of various core mappings (e.g., board to pins)
Production
Data
Monday, November 11, 13
Scaling Pinterest
Why HBase?
• Small, but growing loyal community
• Difficult to hire for, but...
• Non-volatile, O(1), extremely fast and efficient storage
• Strong Hadoop integration
• Consistently good performance
• Used by Facebook (bigger than us)
• Seems to work well
• Open Source
Production
Data
Monday, November 11, 13
Scaling Pinterest
HBase Usage
• User feeds (pin IDs are pushed to feeds)
• Rich pin details
• Spam features
• User relationships to pins
Production
Data
Monday, November 11, 13
Scaling Pinterest
What happened to Cassandra,
Mongo, ES, and Membase?
Production
Data
• Does it meet your needs?
• How mature is the product?
• Is it commonly used? Can you hire people who have used it?
• Is the community active? Can you get help?
• How robust is it to failure?
• How well does it scale? Will you be the biggest user?
• Does it have a good debugging tools? Profiler? Backup
software?
• Is the cost justified?
Monday, November 11, 13
Scaling Pinterest
A 2nd chance...
Monday, November 11, 13
Scaling Pinterest
Stuff we could have done better
• Logging on day 1 (StatsD, Kafka, Map Reduce)
• Log every request, event, signup
• Basic analytics
• Recovery from data corruption or failure
• Alerting on day 1
A 2nd
Chance
Monday, November 11, 13
Scaling Pinterest
Stuff we could have done better
• Shard our MySQL storage much earlier
• Once you start relying on read slaves, start the
timebomb countdown
• We also fell into the NoSQL trap (Membase,
Cassandra, Mongo, etc)
• Pyres for background tasks day 1
• Hire technical operations eng earlier
• Chef / Puppet earlier
• Unit testing earlier (Jenkins for builds)
A 2nd
Chance
Monday, November 11, 13
Scaling Pinterest
Stuff we could have done better
• A/B testing earlier
• Decider on top of Zookeeper WATCH
• Progressive roll out
• Kill switches
A 2nd
Chance
Monday, November 11, 13
Scaling Pinterest
Looking Forward
• Continually improve Pinner experience
• Help Pinners discover more of the things they love
• Better uptime and lower latency
• Faster development times
• Reduce spam and abuse
• Continually improve collaboration and build bigger,
better, faster products
• 180 Pinployees and beyond
What’s
next?
Monday, November 11, 13
Scaling Pinterest
Have fun
Monday, November 11, 13
Scaling Pinterest
If I could do it all over again...
• Stronger ACID transactional guarantees across multiple
systems
• Currently have: sometimes A, best effort C, I, D,
no silent failure
• Want: sometimes A, eventual C, I, D, no silent
completion
My 2nd
Chance
Monday, November 11, 13
Scaling Pinterest
Transactional tasks
• All tasks become a dependency tree of repeatable
synchronous or asynchronous actions
• All actions must be repeatable
• Otherwise, must add repeatability
• All tasks get a unique transaction number
• Counters are tricky
My 2nd
Chance
Monday, November 11, 13
Scaling Pinterest
Transactional tasks
• All tasks become a dependency tree of repeatable
synchronous or asynchronous actions
• Sync actions are executed in order
• Async actions are executed in any order
• Repeat until successful or too many failures
• Too many failures -> put in per task failure queue
• Gives eventual C, I, D
• No silent completion and A require extra effort
My 2nd
Chance
Monday, November 11, 13
Scaling Pinterest
Transactional tasks example
• Pin create sync
• Write empty pin object
• Write pin ID to board, likes, user’s pins, clear caches
• Write pin object
• Pin not shown until pin object created -> Atomicity!
My 2nd
Chance
Monday, November 11, 13
Scaling Pinterest
Transactional tasks example
• Pin create async
• Write pin to required user feeds and public feeds
• Feeds are sorted sets. Reinsertion is okay.
• Send emails, Facebook Likes, Twitter Tweets
• Before send, check / record in temporary storage
-> Gives (temporary) repeatability
My 2nd
Chance
Monday, November 11, 13